Methods and system for predicting multi-variable outcomes

ABSTRACT

Systems methods and recordable media for predicting multi-variable outcomes based on multi-variable inputs. Additionally, the models described can be used to predict the multi-variable inputs themselves, based on the multi-variable inputs, providing a smoothing function, acting as a noise filter. Both multi-variable inputs and multi-variable outputs may be simultaneously predicted, based upon the multi-variable inputs. The models find a critical subset of data points, or “tent poles” to optimally model all outcome variables simultaneously to leverage communalities among outcomes.

CROSS-REFERENCE

This application claims the benefit of U.S. Provisional Application No.60/368,586, filed Mar. 29, 2002, which application is incorporatedherein, in its entirety, by reference thereto.

FIELD OF THE INVENTION

The present invention relates to software, methods, and devices forevaluating correlations between observed phenomena and one or morefactors having putative statistical relationships with such observedphenomena. More particularly, the software, methods, and devicesdescribed herein relate to the prediction of the suitability of newcompounds for drug development, including predictions for diagnosis,efficacy, toxicity, and compound similarity among others. The presentinvention may also be applicable in making predictions relating to othercomplex, multivariate fields, including earthquake predictions, economicpredictions, and others. For example the transmission of seismic signalsthrough a particular fault may exhibit significant changes in propertiesprior to fault shifting. One could use the seismic transmissions of themany small faults that are always active near major fault lines.

BACKGROUND OF THE INVENTION

The application of statistical methods to the treatment of disease,through drug therapy, for example, provides valuable tools toresearchers and practitioners for effective treatment methodologiesbased not only on the treatment regimen, but taking into account thepatient profile as well. Using statistical methodologies, physicians andresearch scientists have been able to identify sources, behaviors, andtreatments for a wide variety of illnesses. Thus, for example, in thedeveloped world, diseases such as cholera have been virtually eliminateddue in great part to the understanding of the causes of, and treatmentsfor, these diseases using statistical analysis of the various risk andtreatment factors associated with these diseases.

The most widely used statistical methods currently used in the medicaland drug discovery fields are generally limited to conventionalregression methods which relate clinical variables obtained frompatients being treated for a disease with the probable treatmentoutcomes for those patients, based upon data relating to the particulardrug, drugs or treatment methodology being performed on that patient.For example, logistic regression methods are used to estimate theprobability of defined outcomes as impacted by associated information.Typically, these methods utilize a sigmoidal logistic probabilityfunction (Dillon and Goldstein 1984) that is used to model the treatmentoutcome. The values of the model's parameters are determined usingmaximum likelihood estimation methods. The nonlinearity of theparameters in the logistic probability function, coupled with the use ofthe maximum likelihood estimation procedure, makes logistic regressionmethods complicated. Thus, such methods are often ineffective forcomplex models in which interactions among the various clinicalvariables being studied are present, or where multivariablecharacterizations of the outcomes are desired, Such as whencharacterizing all experimental drug. In addition, the coupling oflogistic and maximum likelihood methods limits the validation oflogistic models to retrospective predictions that can overestimate themodel's true abilities.

Such conventional regression models can be combined with discriminantanalysis to consider the relationships among the clinical variablesbeing studied to provide a linear statistical model that is effective todiscriminate among patient categories (e.g., responder andnon-responder). Often these models comprise multivariate products of theclinical data being studied and utilize modifications of the methodscommonly used in the purely regression-based models. In addition, thecombined regression/discriminant models can be validated usingprospective statistical methods in addition to retrospective statisticalmethods to provide a more accurate assessment of the model's predictivecapability. However, these combined models are effective only forlimited degrees of interactions among clinical variables and thus areinadequate for many applications.

The Similarity Least Square Modeling Method (SMILES) disclosed in U.S.Pat. No. 5,860,917 (of which the present inventor is a co-inventor), andwhich is hereby incorporated, in its entirely, by reference thereto, iscapable of predicting an outcome (Y) as a function of a profile (X) ofrelated measurements and observations based on a viable definition ofsimilarity between such profiles. SMILES fails, however, to provide ameans to effectively handle multiple outcome variables or outcomes ofdifferent types. For multiple outcome variables, or Y-variables, SMILESanalyzes each Y-variable separately as independent measurements orobservations. Thus, one obtains a separate model for each Y-variable.When the Y-variables measure the same phenomena, they likely haveinduced interdependencies or communalities. It becomes difficult toperform analysis with separate independent models. Nuisance and noisefactors complicate this task even further.

What is needed, needed, therefore, are methods of providingstatistically meaningful models for analyzing the Y-variables as anensemble of related observations, to produce a common model for allY-variables as a function of multiple X-variables to obtain a moreefficient model with better leverage on common phenomena and less noise.

SUMMARY OF THE INVENTION

The present invention includes systems, methods and recordable media forpredicting multi-variable outcomes based on multi-variable inputs. Inone aspect of the invention, a predictor model is generated by: a)defining an initial model as Model Zero and inputting Model Zero asinitial column(s) one or more of a similarity matrix T; b) performing anoptimization procedure (e.g., least squares regression or other linearregression procedure, nonlinear regression procedure, maximum entropyprocedure, mini-max entropy procedure or other optimization procedure)to solve for matrix values of an α matrix which is a transformation ofoutcome profiles associated with input profiles; c) calculating aresidual matrix ε based on the difference between the actual outcomevalues and the predicted outcome values determined through a product ofmatrix T and matrix α, d) selecting a row of the a residual matrix εwhich contains an error value most closely matching a pre-defined errorcriterion; e) identifying a row from a matrix of the multivariableinputs which corresponds to the selected row front the residual matrixε; f) calculating similarity values between the identified row and eachof the rows in the matrix of the multivariable inputs, including theidentified row with itself; g) populating the next column of similaritymatrix T with the calculated similarity values if it is determined thatsuch column of the identified row is not collinear or nearly collinearwith Model Zero and columns of previously identified rows, thesimilarity values for which were used to populate Such previous columnsof similarity matrix T; and h) repeating steps b) through g) until apredefined stopping criterion has been reached.

In another aspect of the present invention, the predictor model may beused to predict multi-variable outcomes for multi-variable income dataof which the outcomes are not known.

In another aspect of the present invention, the model learns torepresent a process from process profile data such as process input,process output, process parameters, process controls and/or processmetrics, so that the trained model is useful for process optimization,model-based process control, statistical process control and/or qualityassurance and control.

In another aspect of the present invention, a model may be used toself-predict multi-variable profiles, wherein the input multivariableprofiles are used to predict the input multivariable profiles themselvesas multi-variable outputs.

In another aspect of the present invention, the self-prediction model isused iteratively to impute data values to missing data values in themultivariable input profiles.

In another aspect of the present invention, a model is used tosimultaneously predict both multi-variable X-input profiles and multivariable Y-output profiles based oil the multi-variable X-inputprofiles.

In another aspect Y-columns may be similarity values of a select subsetof the original Y-variables by analogy to S-columns as similarity valuesof the X-variables.

In another aspect of the present invention, score functions may beoptimally assigned to the predicted multi-variable outcomes for use inany multivariate distribution process, such as ordinal, logistic, andsurvival probability analysis and predictions.

In yet another aspect, the identified rows, also described asmath-functional “tent pole” locations, may be tested for ellipticity asa function of the X-space, using the Marquardt-Levenberg algorithm, andthen ranked according to the testing.

Still further, the present invention may include determining one or moredecay constants for each of the identified rows of X-profiles (tent polelocations) used to calculate similarity values to populate the T matrix(similarity matrix).

Methods, systems and recordable media are disclosed for generating apredictor model for predicting multi-variable outcomes (a matrix of rowsof Y-profiles) based upon multivariable inputs (a matrix of rows ofX-profiles) with consideration of nuisance or noise variables, byanalyzing each X-profile row of multivariable inputs as an object;calculating similarity among the objects; selecting tent pole locationsdetermined to be critical profiles in supporting a prediction functionfor predicting the Y-profiles; determining a maximum number of suchprofiles by model properties such as collinearity or max fit error orleast squares sum of squared errors; and optimizing the final number oftent poles by prospective “true” prediction properties such as theminimum of the sum of squared “prospective errors or ensemble errors”between the Y-profile predictions and the know Y-profile value(s).

According to the present invention, the dimensions of the data can bereduced to a lower dimension as defined only by necessary criticalcomponents to represent the phenomenon being modeled. Hence, in general,the present invention is valuable to help researchers “see” thehigh-dimensional patterns from limited noisy data on complex phenomenonthat can involve multiple inputs and multiple consequential outputs(e.g., outcomes or responses).

The present invention can optimize the model fit and/or the modelpredictions and provides diagnostics that measure the predictive and fitcapabilities of a derived model. Input profile components maysimultaneously be included as outcome variables and vice versa, thusenabling a nonlinear version of partial least squares that inducesproper matrix-eigenvalue matching between input and output matrices.Eigenvalue matching is well-practiced as lineal transformations relatedto generalized singular value decompositions (GSVD). The presentinvention can also be used for self-prediction imputation and smoothing,e.g., predicting smoothed and missing values in input data based on keyprofiles in the input data.

The present invention includes the capability to measure the relativeimportance of individual input variables to the prediction and fitprocess by nonlinear statistical parameters calculated by theMarquardt-Levenberg algorithm. The present invention can also associatedecay constants with each location (tent poles) which is useful toquantity types and scopes of the influence of that profile on the model,i.e., local and/or global effect.

The present invention finds a critical subset of data points tooptimally model all outcome variables simultaneously to leverage bothcommunalities among outcomes and uniqueness properties of each outcome.The method relates measured variables associated with a complexphenomenon using a simple direct functional process that eliminatesartifactual inferences even if the data is sparse or limited and thevariable space is high dimensional. The present invention can also belayered to model higher-ordered features, e.g., output of a GSMILESnetwork can be input to a second GSMILES network. Such GSMILES networksmay include feedback loops. If profiles include one or more orderedindices such as “time,” GSMILES networks can incorporate the ordering ofsuch indices (i.e., “time” series). GSMILES also provides statisticalevaluations and diagnostics of the analysis, both retrospective andprospective scenarios. GSMILES reduces random noise by combining datafrom replicate and nearby adjacent information (i.e.,pseudo-replicates).

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an architecture diagram showing examples of input sources thatmay supply data to the predictor system according to the presentinvention.

FIG. 2 is a schematic diagram illustrating the ability of GSMILES torelate Y-profiles to X-profiles through an X-profile similarity map thatperforms nonlinear-X transforms of strategic Y-profiles. The similaritymatrix assuming no Model Zero (i.e., null Model Zero) is renormalized sothat each row becomes a vector of convex coefficients, i.e., whose sumequals one with each coefficient in interval [0,1].

FIG. 3 is an example matrix containing a training set of X-profiles,Y-profiles, and a noise or nuisance profile used by GMILES in forming apredictor inference model. Such nuisance profile can represent manyvariables, i.e., a vector of noise factors usually with specificsunknown.

FIG. 4 is a diagram of a function 400 shown in a three-dimensionalspace, illustrating support locations along the function that can be“supported” by critical values (or profiles, i.e., the locations for thealpha coefficients representing the size and direction of the “tentpole”) in the X-Y space.

FIG. 5 illustrates an example of an initial model (Model Zero) used tosolve for the critical profiles, in the example shown, the firstcritical profile or tent poles is being solved for.

FIG. 6 shows the error matrix resulting from processing, using theexample shown in FIG. 5.

FIG. 7 shows a second iteration, following the example of FIGS. 5 and 6,used to solve for the second tent pole.

FIG. 8 shows an example of a test X-profile being inputted to GSMILES inorder to predict a Y-Profile for the same.

FIG. 9 is a flow chart showing one example of an iterative procedureemployed by GSMILES in determining a predictor model.

FIG. 10 is a flow chart representing some of the important process stepsin one example of an iterative algorithm that the present inventionemploys to select the columns of a similarity matrix.

FIG. 11 is a graph plotting the maximum absolute (ensemble) error versusthe number of tent poles used in developing a model (training or fiterror versus the number of tent poles).

FIG. 12 is a graph plotting the square root of the Sum of the squaredLOO errors divided by the number of terms squared against the number oftent poles, as a measure of test or validation error.

DETAILED DESCRIPTION OF THE INVENTION

Before the present invention is described, it is to be understood thatthis invention is not limited to particular statistical methodsdescribed, as such may, of course, vary. It is also to be understoodthat the terminology used herein is for the purpose of describingparticular embodiments only, and is not intended to be limiting, sincethe scope of the present invention will be limited only by the appendedclaims.

Where a range of values is provided, it is understood that eachintervening value, to the tenth of the unit of the lower limit unlessthe context clearly dictates otherwise, between the upper and lowerlimits of that range is also specifically disclosed. Each smaller rangebetween any stated value or intervening value in a stated range and anyother stated or intervening value in that stated range is encompassedwithin the invention. The upper and lower limits of these smaller rangesmay independently be included or excluded in the range, and each rangewhere either, neither or both limits are included in the smaller rangesis also encompassed within the invention, subject to any specificallyexcluded limit in the stated range. Where the stated range includes oneor both of the limits, ranges excluding either or both of those includedlimits are also included in the invention.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Although any methods andsystems similar or equivalent to those described herein can be used inthe practice or testing of the present invention, the preferred methodsand systems are now described. All publications mentioned herein areincorporated herein by reference to disclose and describe the methodsand/or systems in connection with which the publications are cited.

It must be noted that as used herein and in the appended claims, thesingular forms “a”, “and”, and “the” include plural referents unless thecontext clearly dictates otherwise. Thus, for example, reference to “avariable” includes a plurality of such variables and reference to “thecolumn” includes reference to one or more columns and equivalentsthereof known to those skilled in the art, and so forth.

The publications discussed herein are provided solely for theirdisclosure prior to the filing date of the present application. Nothingherein is to be construed as an admission that the present invention isnot entitled to antedate such publication by virtue of prior invention.Further, the dates of publication provided may be different from theactual publication dates which may need to be independently confirmed.

Definitions

“Microarrays” measure the degree to which genes are expressed in aparticular cell or tissue. One-channel microarrays attempt to estimatean absolute measure of expression. Two-channel microarrays compare twodifferent cell types or tissues and output a measure of relativestrength of expression.

“RTPCR” designates Real Time Polymerized Chain Reaction, and includestechniques such as Taqman™, for example, for high resolution geneexpression profiling.

“Bioassays” are experiments that determined properties of biologicalsystems and measure certain quantities. Microarrays are an example ofbioassays. Other bioassays are fluorescence assays (which cause a cellto fluoresce if a certain biological event occurs) and yeast two-hybrids(which determine whether two proteins of interest bind to each other ornot).

“Chemical data” include the chemical structure of compounds, chemicaland physical properties of compounds (Such as solubility, pH value,viscosity, etc.), and properties of compounds that are of interest inpharmacology, e.g., toxicity for particular tissues in particularspecies, etc.

“Process control” includes all methods such as feed-forward,feed-backward, and model-based control loops and policies used tostabilize, reduce noise, and/or control any process (e.g., productionlines in factories), based on inherent correlations between systematiccomponents and noise components of the process.

“Statistical process control” refers to statistical evaluation ofprocess parameters and/or process-product parameters to verify processstability and/or product quality based on non-correlated noise.

“Genomics databases” contain nucleotide sequences. Nucleotide sequencesinclude DNA (the information in the nucleus of eukaryotes that ispropagated in cell division and is the basis for transcription),messenger RNA (the transcripts that are then translated into proteins),and ribosomal and transfer RNA (part of the translation machinery).

“Proteomics databases” contain amino acid sequences, both sequencesinferred from genomic data and sequences found through various bioassaysand experiments that reveal the sequences of proteins and peptides.

“Publications” include medicine (the collection of biomedical abstractsdistributed by the national library of medicine), biomedical journals,journal articles from related fields, such as chemistry and ecology, orarticles, books or any other published material in the field beingexamined, whether it be geology, economics, etc.

“Patent” includes U.S. patents and patents throughout the world, as veilas pending patent applications that are published.

“Proprietary documents” include those documents which have not beenpublished, or are not intended to be published.

“Medical data” include all data that are generated by diagnosticdevices, such as urinalysis, blood tests, and data generated by devicesthat are currently under investigation for their diagnostic potential(e.g., microarrays, mass spectroscopy data, etc.).

“Patient records” are the records that physicians and nurses maintain torecord a patient's medical history. Increasingly, information iscaptured electronically as a patient interacts with hospitals andpractitioners. Any textual data captured electronically in this contextmay be part of patient records.

When one location is indicated as being “remote” form another, thisrefers to the tow locations which are at least in different buildings,and these locations may be at least one mile, ten miles or at least onehundred miles apart.

“Transmitting” information refers to sending the data representing thatinformation as electrical signals over a suitable communication channel(e.g., a private or public network).

“Forwarding” a result refers to any means of getting that result fromone location to the next, whether by transmitting data representing theresult or physically transporting a medium carrying the data orcommunicating the data.

A “result” obtained from a method of the present invention includes onedirectly or indirectly obtained from use of the present invention. Forexample, a directly obtained “result” may include a predictor modelgenerated using the present invention. An indirectly obtained “result”may include a clinical diagnosis, treatment recommendation, or aprediction of patient response to a treatment which was made using apredictor model which was generated by the present invention.

The present invention provides methods and systems for extractingmeaningful information from the rapidly growing amount of genomic andclinical data, using sophisticated statistical algorithms and naturallanguage processing. The block diagram in FIG. 1 illustrates anexemplary architecture of a predictor system 100 according to oneembodiment of the present invention. The predictor system 100 takesinput from various sources (such as microarrays 102, bioassays 104,chemical data 106, genomics/proteomics 108,publications/patents/proprietary documentation 110, medical data 112,and patient records 114, (as indicated in FIG. 1) and preprocesses theinput using one or more of the ETL (Extraction/Transformation/Loadingmodule, a standard data mining module for getting the data into a formatyou can work with) 120, text mining 122, Blast 124, and datainterpretation 124 modules.

The ETL module 120 extracts data relating to one or more entities (e.g.,compounds) from a data source. The extracted data correspond to inputand output variables to be used in the GSMILES model for the particularcompound. Examples of data extraction and manipulation tasks supportedby the ETL module include XML parsing; recognizing various columns androw delimiters in unstructured files; and automatic recognition of thestructure of a file (e.g., XML, unstructured, or some other dataexchange format).

Once the ETL module extracts the data, it may transform the data withsimple preprocessing steps. For example, the ETL module may normalizethe data and filter out noise and non-relevant data points. The ETLmodule then loads the data into the RDBMS (i.e., relational databasemanagement system) in a form that is usable in the GSMILES process,e.g., the input and output variables according to the GSMILES model.Specifically, the ETL module loads the extracted (and preferablypreprocessed) data into the RDBMS in fields corresponding to the inputand output variables for the entities to which the data relate.

The ETL module may be run in two modes. If a data source is availablepermanently, data are processed in batch mode and stored in the RDBMS.If it data source is interactively supplied by the user, it will beprocessed interactively by the ETL module.

The text mining module 122 processes textual input from sources such aspublications 110 and patient records 114. Text mining module 122produces two types of outputs: structured output stored in the database130, and unstructured keyword vectors stored in an inverted index (TextIndex) 132. Unlike a conventional inverted index, Text Index 132 alsopreferably functions to retrieve pre-computed keyword vectors. This isimportant for text types such as patient records.

In one embodiment, text mining module 122 includes three components: aterm matching component (including specialized dictionaries and regularexpression parsers for mapping text strings to entities in an underlyingontology); a relationship mapping component (including patterns thatoccur in general language as well as patterns that are specific to thedomain) for recognizing relationships between entities in text (such asdrug-protein interactions and gene-disease causal relationships); and alearning component which learns terms and relationships based on aninitial set of terms and relationships supplied by a domain expert.

In one embodiment, text mining module 122 uses techniques taught by theFASTUS (Finite State Automaton Text Understanding System) System,developed by SRI International, Menlo Park, Calif. These techniques aredescribed in Hobbs et al., “FASTUS: A Cascaded Finite-State Transducerfor Extracting Information form Natural-Language Text”, which can befound at http:///www.ai.sri.com/natural-language/projects/fastus.html,and which is incorporated herein, in its entirety, by reference thereto.Text mining techniques are well-known in the art, and a comprehensivediscussion thereof can be found in the textbook by Christopher D.Manning & Hinrich Schutze, Foundations of Statistical Natural LanguageProcessing (MIT Press: 1st ed., 1999).

The Blast or Homology module 124 detects sequence data in data sources(e.g., microarrays 102, patents 110, patient records 114, etc.), andstores them in a unified format such as FASTA The Homology module 124uses BLAST or other known sequence identification methods. Homologymodule 124 is called interactively for sequence similarity computationby GSMILES 140 (if sequence similarity is part of the overall similaritybetween data points computed).

Data interpretation module 126 performs a number of tasks that go beyondthe more mechanical processing done by ETL module 120. One of the tasksperformed by data interpretation module 126 is that of imputation, wheremissing data are filled in, where possible, using GSMILES processing.Another function of data interpretation module is data linkage. If thesame data type occurs in several sources, but under different names,then data interpretation module 126 reconciles the apparent disparityoffered by the different names, by linking these terms (e.g., such aswhen different naming conventions are used for drugs or genes).

Client 150 allows a user to interact with the system 100. In data sourceselection, the user selects which data sources are most important for aparticular prediction task. I a new data source has become available,the user may add the new data source to the system 100. Weighting may beemployed to determine the relative significance, or weight, of variousdata sources. For example, if a user has prior knowledge indicating thatmost of the predictive power comes from microarrays for a particularclassification task, then the user would indicate this with a largeweighting factor applied to the microarrays data source.

The client 150 performs output function selection when the user selectsone or more particular output categories of interest (i.e., the responsevariables). When a response variable is used for the first time, theuser needs to make it accessible to the system and configure it (e.g.,the user determines what kind of response variable it is, such ascontinuous, dichotomous, polytomous, etc.).

By processing the preprocessed data received from ETL 120, text mining122, Blast 124 and/or data interpretation 126 modules to arrive atpredictive values according to the selected output function orfunctions, GSMILES 140 may provide valuable predictive information as tocompound similarities 152, toxicity 154, efficacy 156, and diagnosis158, but is not limited to such output functions, as has been notedearlier.

Information may be exchanged with Text Index 132.

Module(s) 120,122,124 and/or 126 exchange(s) data with RDBMS 130 and/orText Index 132, as described above. The preprocessed data from module(s)120,122,124 and/or 126 are fed into GSMILES (Generalized SimilarityLeast Squares Modeling Method) predictor module 140, which againexchanges data with Text Index 132 and RDBMS 130, but also takes inputfrom client 150, for example, as to data source selection, weighting ofdata points, and output function selection. The output from GSMILES 140may include predictions for various compounds of diagnosis, efficacy,toxicity, and compound similarity, among others.

One important aspect of the methods and systems disclosed concerns theiruse in the prediction of the suitability of new compounds for drugdevelopment. GSMILES predictor 140 may predict various aspects of acompound, such as toxicity, mode of action, indication and drug success,as well as consideration of similar compounds, while accepting userinput to the various corresponding models. The sum of all the predictionresults can be used at the end to decide which compound to pursue. Bypredicting a compound's mode of action, toxicology, and otherattributes, the present invention facilitates lead prioritization andhelps design experiments.

The present system may utilize the Generalized Similarity Least Squares(GSMILES) modeling method to reveal association patterns within genomic,proteomic, clinical, and chemical information and predict relatedoutcomes such as disease state, response to therapy, survival time,toxic events, genomic properties, immune response/rejection level, andmeasures of kinetics/efficacy of single or multiple therapeutics. TheGSMILES methodology performed by GSMILES module 140 is further discussedin the next section. Other possible applications of GSMILES includeeconomic predictions, early detection of critical earthquake-relatedprocesses from appropriately filtered seismic signals and othergeophysical measurements, and process models for process control ofcomplex chemical processes to improve efficiency and protect theenvironment.

The GSMILES Methodology

A useful method and system for extracting meaningful information fromthe genomic and clinical data requires an efficient algorithm, aneffective model, helpful diagnostic measures and, most importantly, thecapability to handle multiple outcomes and outcomes of different types.The ability to handle multiple outcomes and outcomes of different typesis necessary for many types of complex modeling. For example, genomicand clinical data are typically represented as related series of datavalues or profiles, requiring a multi-variate analysis of outcomes.

The Similarity Least Square Modeling Method (SMILES) disclosed in U.S.Pat. No. 5,860,917 (of which the present inventor is a co-inventor, andwhich was incorporated by reference above), is capable of predicting anoutcome (Y) as a function of a profile (X) of related measurements andobservations based on a viable definition of similarity between suchprofiles. SMILES fails, however, to provide a means to effectivelyhandle multiple outcome variables or outcomes of different types. Formultiple outcome variables, or Y-variables, SMILES analyzes eachY-variable separately as independent measurements or observations. Thus,one obtains a separate model for each Y-variable. When the Y-variablesmeasure the same phenomena, they likely have induced interdependenciesor communalities. It becomes difficult to perform analysis with separateindependent models. Nuisance and noise factors complicate this task evenfurther.

GSMILES remedies this deficiency by analyzing the Y-variables as anensemble of related observations. GSMILES produces a common model forall Y-variables as a function of multiple X-variables to obtain a moreefficient model with better leverage on common phenomena with lessnoise. T his aspect of GSMILES allows a user to find strategic genecompound associations that involve multiple-X/multiple-Y variables onnoisy cell functions or responses to stimuli.

GSIMILES treats each profile of associated measurements of variables asan object with three classes of information: predictor/drivel variables(X-variables), predictee/consequential variables (Y-variables), andnuisance variables (noise variables, known and unknown). Note that theseclasses are not mutually exclusive; hence, a variable can belong to oneor more of such GSMILES classes as dictated by each application.

GSMILES calculates similarity among all such objects using a definitionof similarity based on the X-variables. Note that similarity may becompound, e.g., a combination of similarity measures, where eachsimilarity component is specific to a subset of profile X-variables.GSMILES uses such similarity values to predict the Y-variables. Itselects a critical subset of objects that can optimally predict theY-values of all objects within the precision limitations imposed bynuisance effects, assured by statistically valid criteria. An iterativealgorithm as discussed below may make the selection.

Affine prospective predictions of Y-profiles may be performed to predictprofiles (i.e., row vectors) in the Y-outcome-variable matrix 340 usingmatched profiles in X-input-variable matrix 240, see FIG. 2. Forsimplicity, assume use of a null Model Zero. GSMILES 140 processes thefunction:Z=SR  (1)

where Z is an N×M matrix of predicted Y values (where N and M arepositive integers);

S is an N×P matrix of similarity values between profiles in matrix X(where N and P are positive integers, which may further include one ormore columns of Model Zero values, as will be discussed below); and

R is an X-nonlinear transformation of P Y-profiles associated with Pstrategic X profiles (also referred to as “α” values, below).

The final prediction model according to this methodology is prospective,since each predicted row of Y in turn is used to estimate a prospectiveerror, the sum of squares of which determine the optimal number of modelterms by minimization. The transforms are optimized to minimize theleast-squares error between Z and Y. Thus, R is a P×M matrix of Poptimal transforms of Y-profiles and the similarity values in each rowof S are the strategic affine coefficients for these optimal profiles topredict the associated row in Y. In this way, GSMILES not onlyrepresents Y efficiently, but reduces noise by functional smoothing.

Equation (1) can be easily transformed into a mixture representation bynormalizing each row of S to sum to unity as follows:DZ=DSR  (2)

where D is a diagonal matrix of the inverse of the sum or each row ofmatrix S.

The GSMILES methodology finds the strategic locations in matrix X 240and determines p to optimize the prospective representation of theY-profiles 340, including optimization of relationships within theY-profiles.

Referring to FIG. 3, GSMILES arranges the X-profile and Y-profile, andalso a noise profile 440 in a matrix 300. Noises are like hiddenvariables. Noises are ever present but it is not known how to extractthe values of these variables. All inference models must accommodatenoise. Each row of matrix 300 represents a series of values for relatedvariables, e.g., the X-values for row 1 of the matrix could be known,measured, or inputted values (or may even be dummy variables) whichdirectly effect the Y-values of row 1, which can be thought of as outputor outcome values, and wherein the N_(o)-values (noise) represent thenoise values associated with each row. The left-side 340 of the rows ofmatrix 300, which are populated by the X variables in FIG. 3 define theX-profile of the problem and the right-side (340, 440) of the rows ofmatrix 300, which are populated by the Y and N_(o) variables in FIG. 3define the Y-profile and noise associated with the rows.

Each row of matrix 300 may be treated as a data object, i.e., anencapsulation of related information. The GSMILES methodology analyzesthese objects and compares them with some measure of similarity (ordissimilarity). A fundamental underlying assumption of the GSMILESmethodology is that if the X values are close in similarity, then theY-values associated with those rows will also be close in value. Byprocessing the objects in the matrix 300, a similarity transform matrixmay be constructed using similarity values between selected rows of theX-profile, as will be described in more detail below. The X-profileobjects (rows) are used to determine similarity among one another toproduce similarity values used in the similarity transform matrix.Similarity between rows may be calculated by many different knownsimilarity algorithms, including, but not limited to Euclidean distance,Hamming distance, Minkowski weighted distance, or other known distancemeasurement algorithms. The normalized Hamming function measures thenumber of bits that are dissimilar in two binary sets. The Tanimoto orJaccard coefficient measures the number of bits shared by two moleculesrelative to the ones they could have in common. The Dice coefficient mayalso be used, as well as similarity metrics between images or signalsignatures when the input contains images or other signal patterns, asknown to those of ordinary skill in the art.

With any set of data being analyzed, such as the data in matrix 300, forexample, it has been found that certain, select X-profiles among theobjects are more critical in defining the relationship of the functionsought than are the remainder of the X-profiles. GSMILES solves forthese critical profiles that give critical information about therelationship between the X values and the Y values.

Conceptually speaking, if a function 400 is observed in athree-dimensional space, as shown in FIG. 4, there are certain domainlocations of the function identifying features that can be “supported”by nearby critical data values (or profiles) in the X-Y space. Forexample, the points 410 and 420 in FIG. 4 are such critical values inthe X-Y space. When these locations become the centroids of support forthe range of the function, as facilitated by similarity functions, theytend to adequately support the total surface shape of the range of thefunction. Because of the appearance of this conceptual model, where thefunction range appears somewhat like a circus tent, and the criticaldomain locations, together with their extended impact, appear as tentpoles, the present inventors refer to the critical profiles as “tentpoles”. Of course these “tent poles” can be positive or negative asapplied to a mathematical function. This same concept applies to highdimensional problems and functions. GSMILES calculates the criticalprofiles, which define the locations of the “tent poles”, as well astheir optimized coefficients (i.e., length or size of the tent poles).

To solve for the critical profiles, an initial model (called Model Zero(Model 0) is inputted to the system, in matrix T (See FIG. 5). ModelZero (designated as μ₀ in FIG. 5), may be a conventional model,conceptual model, theoretical model, and X-profile with known Y-profileoutcomes, or some other reasonable model which characterizes a roughapproximation of the association between the X- and Y-profiles, butstill cannot explain or account for a lot of systematic patternseffecting the problem. Thus, Model Zero predicts Y (i.e., the Y valuesin the Y-profile), but not adequately. Alternatively, a null set couldbe used as Model Zero, or a column of equal constants, such as a columnwith each row in the column being the value 1 (one).

A least squares regression algorithm is next performed to solve forcoefficients α₀ (see matrix α, FIG. 5) which will provide a best Fit forthe use of Model zero to predict the Y-profiles, based on the knownquantities in matrix μ₀ and matrix 340. It should be noted here thatthis step of the present invention is not limited to solving by leastsquares regression. Other linear regression procedures, such as medianregression, ordinal regression, distributional regression, survivalregression, or other known linear regression techniques may be utilized.Still further, nonlinear regression procedures, maximum entropyprocedures, mini-max entropy procedures or other optimization proceduresmay be employed. Solving for the α₀ matrix α optimizes Model Zero topredict the Y-profile 340. Then the prediction errors (residuals) arecalculated as follows:Y−( T ·α)=ε  (3)

where

Y=matrix 340;

α=α matrix (which is a 1×M vector in the example shown in FIG. 5);

T=the T matrix (i.e., vector, in this example, although the Model Zeroprofile may be a matrix having more than one column); and

ε=error matrix, or residuals, in this example characterizing Model Zerowith ε₀ values.

The error matrix ε resulting from processing, using the example shown inFIG. 5 is shown in FIG. 6. Next, GSMILES determines the row of the εmatrix which has the maximum absolute value of error. Note that forproblems where the Y-profile is a vector (i.e., an N×1 matrix, i.e.,where M=1), the error matrix ε will be a vector (i.e., an N×1 matrix)and the maximum absolute error can be easily determined by simplypicking the largest absolute value in the error vector. For the exampleshown in FIG. 5, however, the error matrix ε is an N×M matrix, as shownin FIG. 6. To determine maximum values in a matrix of error values, suchas matrix ε, different options are available. The simplest approach,while not necessarily achieving the best results of all the approaches,is to simply pick the maximum absolute error value from the entire setof values displayed in matrix ε. Another approach is to construct anensemble error for each row of error values in matrix ε. One way ofconstructing the ensemble errors is to calculate an average error foreach entire row. This results in an error vector, from which the maximumabsolute error can be chosen.

Whatever technique is used to determine the maximum absolute error, therow from which the maximum absolute error is noted and used to identifythe row (X-profile) from matrix 240, from which similarity values arecalculated. The calculated similarity values are used to populate thenext column of values in the matrix containing Model Zero. For example,at this stage of the processing, the similarity values will be used topopulate the second column of the matrix, adjacent the Model Zerovalues. However, this is an iterative process which can be used topopulate as many columns as necessary to produce a “good or adequatefit”, i.e., to refine the model so that it predicts Y-profiles withinacceptable error ranges. An acceptable error range will vary dependingupon the particular problem that is being studied, and the nature of theY-profiles. For example, a model to predict temperatures may requirepredictions within an error range of ±1° C. for one application, whileanother application for predicting temperature may require predictionswithin an error range of ±0.01° C. GSMILES is readily adaptable tocustomize a model to meet the required accuracy of the predictions thatit produces.

Assuming, for exemplary purposes, that the row from which the maximumabsolute error was found in matrix ε was the seventh, GSMILES thenidentifies the seventh row in matrix 240 to perform the similaritycalculations from. Similarity calculations are performed between theseventh X-profile and each of the other X-profile rows, including theseventh row X-profile with itself. For example, the first row similarityvalue in column 2, FIG. 7 (i.e., S_(7,1)) is populated with thesimilarity value calculated between rows 7 and 1 of the X-profile matrix240. The second row similarity value in column 2, FIG. 7 is populatedwith the similarity value S_(7,2), the similarity value calculatedbetween rows 7 and 2, and so forth. Note that row 7 is populated with asimilarity value calculated between row 7 with itself. This will be themaximum similarity value, as a row is most similar with itself and anyreplicate rows. The similarity values may be normalized so that themaximum similarity value is assigned a value of 1 (one) and the leastsimilar value would in that case be zero. As noted, row 7 was onlychosen as an example, but analogous calculations would be performed withregard to any row in the matrix 240 which was identified ascorresponding to the highest maximum absolute error value, as would beapparent to those of ordinary skill in the art. It is further noted thatselection does not have to be based upon the maximum absolute errorvalue, but may be based on any predefined ensemble error scoring. Forexample, an ensemble average absolute error, ensemble median absoluteerror, ensemble mode absolute error, ensemble weighted average absoluteerror, ensemble robust average absolute error, geometric average,ensemble error divided by standard deviation of errors of ensemble, orother predefined absolute error measure may be used in place of themaximum absolute error or maximum ensemble absolute error.

The X-profile row selected for calculating the similarity values marksthe location of the first critical profile or “tent pole” identified byGSMILES for the model. A least squares regression algorithm is againperformed next, this time to solve for coefficients α₀ and α₁ in thematrix α shown in FIG. 6). Note, that since the T matrix is now an N×2matrix, that matrix at needs to be a 2×M matrix, where the first row ispopulated with the α₀ coefficients (i.e., α_(0 1,1.) α_(0 1,2), . . .α_(0 1,M)). and the second row is populated with the α₁ coefficients(i.e., α_(1 1,1.) α_(1 1,2), . . . α_(1 1,M)). The α₀ coefficients thatwere calculated in the first iteration using only Model Zero arediscarded, so that new α₀ coefficients are solved for, along with α₁coefficients. These coefficients will provide a best fit for the use ofModel Zero and the first tent pole in predicting the Y-profiles. Aftersolving for the coefficients in matrix α, the prediction errors(residuals) are again calculated, using equation (3), where α is a 2×Mmatrix in this iteration, and T is an N×2 matrix. Each row of α may beconsidered a transform of the rows of Y. For linear regression, thistransformation is linear.

Again, GSMILES determines the row of the ε matrix which has the maximumabsolute value of error, in a manner as described above. Whatevertechnique is used to determine the maximum absolute error, the row fromwhich the maximum absolute error is noted and used to identify the row(X-profile) from matrix 240, from which similarity values are againcalculated. The calculated similarity values are used to populate thenext column of values in the T matrix (in this iteration, the thirdcolumn), which identifies the next tent pole in the model. The X-profilerow selected for calculating the similarity values marks the location ofthe next (second, in this iteration) critical profile or “tent pole”identified by GSMILES for the model. A least squares regressionalgorithm is again performed, to perform the next iteration of theprocess, as described above. The GSMILES method can iterate through theabove-described steps until the residuals come within the limits of theerror range desired for the particular problem that is being solved,i.e., when the maximum error from matrix ε in any iteration falls belowthe error range. An example of an error threshold could be 0.01 or 0.1,or whatever other error level is reasonable for the problem beingaddressed. With each iteration, an additional tent pole is added to themodel, thereby reducing the prediction error resulting in the overallmodel.

Alternatively, GSMILES may continue iterations as long as no twoidentified tent poles have locations that are too close to one anotherso as to be statistically indistinct from one another, i.e.,significantly collinear. Put another way, GSMILES will not use two tentpoles which are highly correlated and hence produce highly correlatedsimilarity columns, i.e., which are collinear or nearly collinear (e.g.,correlation squared (R²)>95%, of the two similarity columns produced bythe two X-profiles (tent pole locations). However, even if an X-profileis dissimilar (not near) all selected profiles in the model, it maystill stiffer collinearity problems with columns in the T-matrix as is.Hence, a tent-pole location is added to the model only if it passes bothcollinearity filters.

When a tent pole (row from matrix 240) is identified from the maximumabsolute error in an ε matrix that is determined to be too close (nearlycollinear) to a previously selected tent pole, GSMILES rejects thischoice and moves to the next largest maximum absolute error value inthat ε matrix. The row in matrix 240 which corresponds to the nextlargest maximum absolute error is then examined with regard to thepreviously selected tent poles, by referring to the similarity columncreated for each respective selected X-profile. If this new rowdescribes a tent pole which is not collinear or nearly collinear with apreviously selected tent pole, then the calculated similarity values areinserted into a new column in matrix T and GSMILES processes anotheriteration. On the other hand, if it is determined that this row isnearly collinear or collinear with a previously chosen tent pole,GSMILES goes back to the c matrix to select the next highest absoluteerror value. GSMILES iterates through the error selection process untila tent pole is found which is not collinear or nearly collinear with apreviously selected tent pole, or until GSMILES has exhausted all rowsof the error matrix ε. When all rows of an error matrix ε have beenexhausted, the model has its full set of tent poles and no moreiterations of the above steps are processed for this model.

The last calculated α matrix (α profile from the last iterationperformed by GSMILES) contains the values that are used in the model forpredicting the Y-profile with an X-profile input. Thus, once GSMILESdetermines the critical support profiles and the a values associatedwith them, the model can be used to predict the Y-profile for a newX-profile.

Referring now to FIG. 8, an example is shown wherein a new X-profile(referred to as X*) is inputted to GSMILES in order to predict aY-Profile for the same. For simplicity of explanation, this example usesonly two tent poles, together with Model Zero, to characterize theGSMILES model. In practice, there will generally be many more tent polesemployed. As a result, the α matrix in this example is a 3×M matrix, asshown in FIG. 8, and we have assumed, for example's sake, that thesecond profile is defined by the third row X-profile of the X-profilematrix 240. Therefore, the similarity values in column 3 of matrix T arepopulated by similarity values between row three of the X-profile matrix240 and all rows in the S-profile matrix 240.

Again for simplicity, the example uses only a single X* profile, so thatonly a single row is added to the X-profile 240, making it an (N+1)×nmatrix, with the N+1^(st) row being populated with the X* profilevalues, although GSMILES is capable of handling multiple rows ofX-profiles simultaneously, as would be readily apparent to those ofordinary skill in the art in view of the description of FIGS. 3-7 above.

Because the X-profile matrix has been expanded to N+1 rows, Model Zeroin this case will also contain N+1 components (i.e., is an (N+1)×1vector)) as shown in FIG. 8. The tent pole similarity values for tentpoles one and two (i.e., columns 2 and 3) of the T matrix are populatedwith the previously calculated similarity values for rows 1-N. Row N+1of the second column is populated with a similarity value found bycalculating the similarity between row 7 and row N+1 (i.e., the X*profile) of the new X-profile matrix 240. Similarly, Row N+1 of thethird column is populated with a similarity value found by calculatingthe similarity between row 7 and row N+1 (i.e., the X* profile) of thenew X-profile matrix 240.

GSMILES then utilizes the α matrix to solve for the Y_(N+1) profileusing the X_(N+1) profile (i.e., X* profile) using the followingequation:T·α=Y+ε   (4)

where, for this example,

T=the N+1^(st) row of the T matrix shown in FIG. 8,

α=the α matrix shown in FIG. 8,

Y=the N+1^(st) row of the matrix 340 shown in FIG. 8,

ε=a vector of M error values associated with the Y-profile outcome.

The error values will be within the acceptable range of permitted errordesigned into the GSMILES predictor according to the iterationsperformed in determining the tent poles as described above.

Typically, GSMILES overfits the data, i.e., noise are fit as systematiceffects when in truth they tend to be random effects. The GSMILES modelis trimmed back to the minimum of the sum of squared prospectiveensemble errors to optimize prospective predictions, i.e., to removetent poles that contribute to over fitting of the model to the data usedto create the model, where even the noise associated with this data willtend to be modeled with too many tent poles.

Once the model is determined, the Z-columns of distribution-based U'sare treated as linear score functions where the associated distribution,Such as the binomial logistic model, for example, assigns probability toeach of the score values.

The initial such Y-score function is estimated by properties of theassociated distribution, e.g., for a two-category logistic, assign thevalue +1 for one class and the value −1 for the other class. Anothermethod uses a high-order polynomial in a conventional distributionanalysis to provide the score vector. The high order polynomial isuseless for making any type of predictions however. The GSMILES modelaccording to the present invention predicts this score vector, therebyproducing a model with high quality and effective prediction properties.The GSMILES model can be further optimized by using the criticalS-columns of the similarity matrix directly in the distributionaloptimization that could also include conventional X-variables and/orModel Zero. Hence, GSMILES provides a manageable set of high-leverageterms for distributional optimizations such as provided by generalizedlinear, mixed, logistic, ordinal, and survival model regressionapplications. In this fashion, GSMILES is not restricted to univariatebinomial logistic distributions, because GSMILES can predict multiplecolumns of Y (in the Y-profile 340). Thus, GSMILES can simultaneouslyperform logistic regression, ordinal regression, survival regression,and other regression procedures involving multiple variable outcomes(multiple responses) as mediated by the score-function device. Somescore functions produced by GSMILES do not require distributionalmodels, but are useable as is. For example, for continuous variables,such as temperature, these outcomes can be analyzed by directly usingthe score function, without the need for logistic analysis. Othernon-continuous variable outcomes may also not need logistic analysis,but may be used directly from a score function. For logistic regression,GSMILES assumes a binomial distribution pattern for scoring, while amultinomial distribution is assumed for ordinal regression and aGaussian distribution is assumed for many other types of regression(continuous variables).

GSMILES can also fit disparate properties at the same time and providescore functions for them. For example, the Y columns may includedistributional, text and continuous variables, all within the samematrix, which can be predicted by the model according to the presentinvention.

GSMILES can also perform predictions and similarity calculations ontextual values. When text variables are included in the X-profile and/orthe Y-profile, similarity calculations are performed among the rows oftext, so that similarity values are also placed into the Y-profile,where the regression is performed with both predictor similarity valuesand predictee similarity values (i.e., similarity values are inserted onboth sides of the equation, both in the X-profile, as well as theY-profile).

The GSMILES methodology can also be performed on a basis ofdissimilarity, by forming a dissimilarity matrix according to the sametechniques described above. Since dissimilarity, or distance has aninverse relationship to similarity, one of ordinary skill in the altwould readily be able to apply the techniques disclosed herein to form aGSMILES model based upon dissimilarity between the rows of theX-profile.

Leave-One-Out Cross-Validation

When modeling according to the GSMILES methodology, as with any type ofprediction model, both fit error (training error) and validation error(test error) are encountered. In this case, fit error is the error thatresults in the ε matrix at the final iteration of determining the αmatrix according to the above-described methodology, as GSMILESoptimizes the training set (N×n matrix 240) to predict the training setY-profile 340 (N×M matrix). Validation error is the error resulting fromapplying the model to an independent data set. For example, thevalidation error resulting in the example described above with regard toFIG. 8 is the E vector containing the M values of error associated withthe N+1^(st) row of the matrix 340 shown in FIG. 8.

In general, to determine test or validation error, the model determinedwith the training set is applied to an independent set of data (the testor validation set) which has known Y-outcome values. The model isapplied to the X-profile of the test set to determine the Y-profile. Thecalculated Y-profile is then compared with the known Y-profile tocalculate the test or validation error, and the test or validation erroris then examined to determine whether it is within the preset,acceptable range of error permitted by the model. If the test orvalidation error is within the predefined limits of the error range,than the model passes the validation test. Otherwise, it may bedetermined that the model needs further revision, or other factorsprevent the model from being used with the test profile. For example,the test profile may contain some X values that are outside the range ofX-values that the present model can effectively form predictions on.Some of the X-variables may have little association with the Y-profilesand hence they contribute non-productive variations thereby reducing theefficiency of the GSMILES modeling process. Hence, more data would berequired to randomize out the useless variations of such non-productiveX-variables. Optionally, one can identify and eliminate such noisyX-variables, since they tend to have very low rank via theMarquardt-Levenberg (ML) ranking method described in this document. Toidentify a rank threshold between legitimate and noisy X-variables, anintentional noisy variable may be included in the X-profile and its MLrank noted. Repetition of this procedure with alternate versions of thenoisy X-column, e.g., by random number generations, produces adistribution of such noise ranks, whose statistical properties may beused to set an X-noise threshold.

The leave-one-out cross-validation technique involves estimating thevalidation error through use of the training set. As an example,assuming that matrix 240,340 in FIG. 3 is the initial training set, theleave-one-out technique involves extracting one of the rows of thetraining set prior to carrying out the GSMILES methodology to solve forsimilarity and the ca matrix that are described above. So, in this case,the “altered” training set will include an X-profile which is an (N−1)×nmatrix and a Y-profile which is an (N−1)×M matrix. The extracted row(for a non-limiting example, we can assume that row 5 was extracted)becomes the validation set that will be used after solving for theGSMILES model.

Using the altered training data set, an α matrix is solved for using thetechniques described above with regard to the GSMILES least squaresmethodology. After determining the α matrix, this α matrix is then usedto predict the outcome for the extracted row (i.e., the test set, row 5in the current example). Because the Y-profile of the test set is known,the known Y-values can be compared with the predicted Y-values todetermine the validation error and to determine whether this validationerror is within the acceptable range of error.

The same procedure may be carried out for each row of the originaltraining data set 240,340, one row at a time. In this way, each profileused in the training data set can be used independently as a validationdata set. By summing the squares of the errors derived from eachextracted row and dividing by the number or rows, a variance can bedetermined for the validation error (i.e., validation variance).However, to require validation error to be determined by completelyprocessing through the GSMILES methodology to independently determine anα matrix for each extracted row, is to require a great deal ofprocessing time, particularly for typical data sets which may containthousands of rows. This is both time consuming and expensive, andtherefore inefficient.

For simplicity and clarity, standard notation is used in the followingdiscussion wherein a single variable denoted y is a function of a vectorof variables denoted by x. Note that this x actually represents theT-rows in the GSMILES formulism referred to above. Without loss ofgenerality consider a single y-variable as a function of multiplex-variables. A generalized solution for the Leave-One-Out (LOO)cross-validation statistic for a model f(x;α) trained on a data setD={(x₁,y₁), . . . , x_(n),y_(n))}, x_(i)ε

^(m), y_(i)ε

, where a single data point (x_(i), y_(i)) is removed, results in atraining set D_(i) and a predictor f_(i)(x, α). The difference betweenthe observation y_(i) and what a model predicts in the absence of(x_(i), y_(i)) is ε_(i)=y_(i)−f_(i)(x_(i), α). The Leave-One-Out (LOO)cross-validation statistic predicts the variance in this error:$\begin{matrix}{\sigma_{LOO}^{2} = {\frac{1}{n}{\sum\limits_{i = 1}^{n}ɛ_{i}^{2}}}} & (5)\end{matrix}$

Rather than evaluating LOO by retraining the model n times, aformulation which relates σ² _(LOO) to the quantities already used intraining f(x;α) is needed in order to avoid the inefficiencies andexpense of completely processing through the GSMILES methodology toindependently determine an α vector for each extracted row, as alludedto above. This is possible for linear models f(x;α)=α^(T)x, αε

^(m). If the data matrix and response vector are defined as:$\begin{matrix}{{X = \begin{pmatrix}x_{1}^{T} \\x_{2}^{T} \\\vdots \\x_{n}^{T}\end{pmatrix}}{y = \begin{pmatrix}y_{1} \\y_{2} \\\vdots \\y_{n}\end{pmatrix}}} & (6)\end{matrix}$then the linear least squares solution α and corresponding residual ρare: $\begin{matrix}{\alpha = {\left( {X^{T}X} \right)^{- 1}X^{T}y}} & (7) \\{\rho = {y - {X\quad\alpha}}} & (8) \\{\quad{= {y - {{X\left( {X^{T}X} \right)}^{- 1}X^{T}y}}}} & (9) \\{\quad{= {\left( {I - {{X\left( {X^{T}X} \right)}^{-}1X^{T}}} \right)y}}} & (10) \\{\quad{= {Py}}} & (11)\end{matrix}$where P≡I−X(X^(T)X)⁻1X^(T) is the n×n projection matrix. If the firstdata point is partitioned from the data matrix, the abbreviated trainingset defines a matrix X and response vector y related to the original asfollows: $\begin{matrix}{{X = \left( \frac{x_{1}^{T}}{X} \right)}{y = \left( \frac{y_{1}}{y} \right)}} & (12) \\{{X^{T}X} = {{{\overset{\_}{X}}^{T}\overset{\_}{X}} + {x_{1}^{T}x_{1}}}} & (13) \\{{X^{T}y} = {{{\overset{\_}{X}}^{T}\overset{\_}{y}} + {y_{1}x_{1}}}} & (14)\end{matrix}$

The least squares solution of the truncated data set is:=( X ^(T) X)⁻¹ X ^(T) y  (15)

The prediction error resulting from the removal of the first row istherefore:ε₁ =y ₁ α ^(T) x ₁  (16)

The relationships defined in equations (12), (13) and (14) are next usedto replace X, y and α. First, the Sherman-Morrison-Woodbury formulaestablishes that: $\begin{matrix}\begin{matrix}{\left( {{\overset{\_}{X}}^{T}\overset{\_}{X}} \right)^{- 1} = \left( {{X^{T}X} - {x_{1}^{T}x_{1}}} \right)^{- 1}} \\{= {\left( {X^{T}X} \right)^{- 1} + \frac{\left( {X^{T}X} \right)^{- 1}x_{1}{x_{1}^{T}\left( {X^{T}X} \right)}^{- 1}}{1 - {{x_{1}^{T}\left( {X^{T}X} \right)}^{- 1}x_{1}}}}}\end{matrix} & (17)\end{matrix}$

For the sake of abbreviation, define F=(X^(T)X)⁻¹, d₁=x₁ ^(T)Fx₁, andμ₁=1−d₁. Note that μ₁ and d₁ are scalars. Substituting theserelationships gives: $\begin{matrix}{= {\left\lbrack {F + {\frac{1}{u_{1}}{Fx}_{1}x_{1}^{T}F}} \right\rbrack\left( {{X^{T}y} - {y_{1}x_{1}}} \right)}} & (18) \\{= {{\frac{1}{u_{1}}\left\lbrack {{u_{1}F} + {{Fx}_{1}x_{1}^{T}F}} \right\rbrack}\left( {{X^{T}y} - {y_{1}x_{1}}} \right)}} & (19) \\{= {\frac{1}{u_{1}}\left\lbrack {{u_{1}{F\left( {{X^{T}y} - {y_{1}x_{1}}} \right)}} + {{Fx}_{1}x_{1}^{T}{F\left( {{X^{T}y} - {y_{1}x_{1}}} \right)}}} \right\rbrack}} & (20) \\{= {\frac{1}{u_{1}}\left\lbrack {{u_{1}{FX}^{T}y} - {u_{1}y_{1}{Fx}_{1}} + {{Fx}_{1}x_{1}^{T}{FX}^{T}y} - {y_{1}d_{1}{Fx}_{1}}} \right\rbrack}} & (21)\end{matrix}$

Returning to the prediction error of equation (16) and substituting withthe above developed relationship gives: $\begin{matrix}{ɛ_{1} = {y_{1} - {{\overset{\_}{\alpha}}^{T}x_{1}}}} & (16) \\{\quad{= {y_{1} - {x_{1}^{T}\overset{\_}{\alpha}}}}} & (22) \\{\quad{= {\frac{1}{u_{1}}\left( {{u_{1}y_{1}} - {x_{1}^{T}\left( {u_{1}\overset{\_}{\alpha}} \right)}} \right)}}} & (23) \\{\quad{= {\frac{1}{u_{1}}\begin{bmatrix}{{u_{1}y_{1}} - {u_{1}x_{1}^{T}{FX}^{T}y} + {u_{1}y_{1}x_{1}^{T}{FX}_{1}} -} \\{{x_{1}^{T}{Fx}_{1}x_{1}^{T}{FX}^{T}y} + {{yd}_{1}x_{1}^{T}{Fx}_{1}}}\end{bmatrix}}}} & (24) \\{\quad{= {\frac{1}{u_{1}}\left\lbrack {{u_{1}y_{1}} - {u_{1}x_{1}^{T}{FX}^{T}y} + {u_{1}y_{1}d_{1}} - {d_{1}x_{1}^{T}{FX}^{T}y} + {y_{1}d_{1}^{2}}} \right\rbrack}}} & (25) \\{\quad{= {\frac{1}{u_{1}}\left\lbrack {{u_{1}{y_{1}\left( {1 + d_{1}} \right)}} - {\left( {u_{1} + d_{1}} \right)x_{1}^{T}{FX}^{T}y} + {y_{1}d_{1}^{2}}} \right\rbrack}}} & (26) \\{\quad{= {\frac{1}{u_{1}}\left\lbrack {{\left( {1 - d_{1}} \right){y_{1}\left( {1 + d_{1}} \right)}} + {y_{1}d_{1}^{2}} - {x_{1}^{T}{FX}^{T}y}} \right\rbrack}}} & (27) \\{\quad{= {\frac{1}{u_{1}}\left\lbrack {{y_{1}\left( {1 - d_{1}^{2}} \right)} + {y_{1}d_{1}^{2}} - {x_{1}^{T}{FX}^{T}y}} \right\rbrack}}} & (28) \\{\quad{= {\frac{1}{u_{1}}\left\lbrack {y_{1} - {x_{1}^{T}{FX}^{T}y}} \right\rbrack}}} & (29) \\{\quad{= \frac{y_{1} - {{x_{1}^{T}\left( {X^{T}X} \right)}^{- 1}y}}{1 - {{x_{1}^{T}\left( {X^{T}X} \right)}^{- 1}x_{1}}}}} & (30)\end{matrix}$

By noting that y₁=e₁ ^(T)y and x₁ ^(T)=e₁ ^(T)X, where ε₁=[1 0 0 . . .0]^(T), gives: $\begin{matrix}{ɛ_{1} = \frac{{e_{1}^{T}y} - {e_{1}^{T}{X\left( {X^{T}X} \right)}^{- 1}X^{T}y}}{1 - {e_{1}^{T}{X\left( {X^{T}X} \right)}^{- 1}X^{T}e_{1}}}} & (31) \\{\quad{= \frac{{e_{1}^{T}\left( {I - {{X\left( {X^{T}X} \right)}^{- 1}X^{T}}} \right)}y}{{e_{1}^{T}\left( {I - {{X\left( {X^{T}X} \right)}^{- 1}X^{T}}} \right)}e_{1}}}} & (32) \\{\quad{= \frac{e_{1}^{T}{Py}}{e_{1}^{T}{Pe}_{1}}}} & (33) \\{\quad{= \frac{e_{1}^{T}\rho}{e_{1}^{T}{Pe}_{1}}}} & (34) \\{\quad{= \frac{\rho_{1}}{e_{1}^{T}{Pe}_{1}}}} & (35) \\{\quad{= \frac{\rho_{1}}{P_{11}}}} & (36)\end{matrix}$

From this it can be observed that the prediction error resulting fromthe removal of the first data point is the ratio of the first element ofthe residual and the first diagonal element of the projection matrix.Since any data point (x_(i),y_(i)) can be permuted to the first rowwithout changing the solution, the conclusion is reached, without anyloss of generality, that: $\begin{matrix}{ɛ_{1} = \frac{\rho_{1}}{P_{ii}}} & (37) \\{and} & \quad \\{\sigma_{LOO}^{2} = {\frac{1}{n}{\sum\limits_{i = 1}^{n}\left( \frac{\rho_{i}}{P_{ii}} \right)^{2}}}} & (38)\end{matrix}$

In order to compute σ² _(LOO) in the context of sequential least-squaresprocessing such as used in the GSMILES methodology (because later it isa useful metric for trimming to the optimal subset of basis vectors(i.e., tent poles)), in each iteration k+1 of the algorithm, a columna_(k+1) is added to the data matrix X_(k) (e.g., such as data matrix240). This gives the general formula:X _(k+1) =[X _(k) a _(k+1)]  (39)

When n is large, forming the projection matrix P in order to extract itsdiagonal elements is impractical, requiring n×n memory, which couldexceed the limits of current hardware. It is also computationallyexpensive, making it infeasible to re-compute at every iteration k.Instead, the QR factorization of X_(k) is computed at every iteration,where: $\begin{matrix}{X_{k} = {{Q_{k}R_{k}} = {Q_{k}\begin{pmatrix}{\overset{\_}{R}}_{k} \\0\end{pmatrix}}}} & (40)\end{matrix}$

Where X_(k)ε

^(n×k), Q_(k)ε

^(n×n), R_(k)ε

^(n×k), Rk ε

^(k×k). Rk is upper triangular and Q_(k) is orthogonal. By design, it isalso non-singular. Q_(k) ^(T) is a product of Householder matrices, asfollows:Q_(k) ^(T)=H_(k)H_(k−1) . . . H₁  (41)

Each Householder matrix is dependent only on ν_(k)ε

^(n), the Householder vector:H _(k) =I−T _(k)ν_(k)ν_(k) ^(T)  (42)

Where T_(k)=2/ν_(k) ^(T)ν_(k). An efficient implementation of thealgorithm will not store Q_(k) or any of its factors explicitly. Onlythe product of Q_(k) with some n vector g, Q_(k) ^(T)g, or Q_(kg) isneeded. For this purpose, storing the set of Householder vectors {ν₁,ν₂,. . . ν_(k)} is sufficient. By design, ν_(k) has the following specialstructure: ν_(k) ^(T)=[0 . . . 0 1 B . . . B], where the 0 elementsextend over k−1 columns and the B elements extend over n-k columns. Arecursive relationship for the projection matrix P can now be shown atthe k^(th) iteration, P_(k): $\begin{matrix}\begin{matrix}{{P_{k} = {I_{n} - {{X_{k}\left( {X_{k}^{T}X_{k}} \right)}^{- 6}X_{k}^{T}}}}\quad} \\{{= {I_{n} - {\left( {Q_{k}R_{k}} \right)\left( {R_{k}^{T}Q_{k}^{T}Q_{k}^{T}R_{k}} \right)^{- 1}\left( {R_{k}^{T}Q_{k}^{T}} \right)}}}\quad} \\{{= {I_{n} - {Q_{k}{R_{k}\left( {R_{k}^{T}R_{k}} \right)}^{- 1}\left( {R_{k}^{T}Q_{k}^{T}} \right)}}}\quad} \\{= {I_{n} - {{Q_{k}\begin{pmatrix}R_{k} \\0\end{pmatrix}}\left( \left\lbrack {\begin{matrix}{\overset{\_}{R}}_{k}^{T} & \left. {\left. 0 \right\rbrack\begin{pmatrix}R_{k} \\0\end{pmatrix}} \right)\end{matrix}^{- 1}\left\lbrack \begin{matrix}{\overset{\_}{R}}_{k}^{T} & \left. {\left. 0 \right\rbrack Q_{k}^{T}} \right)\end{matrix} \right.} \right. \right.}}} \\{= {I_{n} - {{Q_{k}\begin{pmatrix}R_{k} \\0\end{pmatrix}}{\left( {{\overset{\_}{R}}_{k}^{T}{\overset{\_}{R}}_{k}} \right)^{- 1}\left\lbrack \begin{matrix}{\overset{\_}{R}}_{k}^{T} & \left. {\left. 0 \right\rbrack Q_{k}^{T}} \right)\end{matrix}\quad \right.}}}} \\{\left. {= {I_{n} - {{Q_{k}\begin{pmatrix}{{{\overset{\_}{R}}_{k}\left( {{\overset{\_}{R}}_{k}^{T}{\overset{\_}{R}}_{k}} \right)}^{- 1}{\overset{\_}{R}}_{k}^{T}} & 0 \\0 & 0\end{pmatrix}}Q_{k}^{T}}}} \right)\quad} \\{\left. {= {I_{n} - {{Q_{k}\begin{pmatrix}{{{\overset{\_}{R}}_{k}\left( {\overset{\_}{R}}_{k} \right)}\left( {\overset{\_}{R}}_{k}^{T} \right)^{- 1}{\overset{\_}{R}}_{k}^{T}} & 0 \\0 & 0\end{pmatrix}}Q_{k}^{T}}}} \right)\quad} \\{\left. {= {I_{n} - {{Q_{k}\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix}}Q_{k}^{T}}}} \right)\quad} \\{{= {I_{n} - {H_{1}\ldots\quad H_{k - 1}{H_{k}\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix}}H_{k}H_{k - 1}\ldots\quad H_{1}}}}\quad}\end{matrix} & \begin{matrix}\begin{matrix}\begin{matrix}\begin{matrix}\begin{matrix}\begin{matrix}\begin{matrix}\begin{matrix}(43) \\(44)\end{matrix} \\(45)\end{matrix} \\(46)\end{matrix} \\(47)\end{matrix} \\(48)\end{matrix} \\(49)\end{matrix} \\(50)\end{matrix} \\(51)\end{matrix}\end{matrix}$Furthermore, $\begin{matrix}{{{{H_{k}\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix}}H_{k}} = {\left( {I_{n} - {T_{k}v_{k}v_{k}^{T}}} \right)\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix}\left( {I_{n} - {T_{k}v_{k}v_{k}^{T}}} \right)}}\quad} & (52) \\\begin{matrix}{\quad{= {\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix} - {T_{k}v_{k}{v_{k}^{T}\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix}}} - {{T_{k}\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix}}v_{k}v_{k}^{T}} +}}} \\{T_{k}^{2}v_{k}{v_{k}^{T}\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix}}v_{k}v_{k}^{T}}\end{matrix} & (53)\end{matrix}$As a result of the special structure of ν_(k), $\begin{matrix}{{{\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix}v_{k}} = e_{k}},{and}} & (54) \\{\quad{{e_{k}^{T}v_{k}} = 1}} & (55)\end{matrix}$and thus, $\begin{matrix}\begin{matrix}{{{H_{k\quad}\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix}}H_{k}} = {\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix} - {T_{k}v_{k}e_{k}^{T}} - {T_{k}e_{k}v_{k}^{T}} + {T_{k}^{2}v_{k}e_{k}^{T}v_{k}v_{k}^{T}}}} \\{= {\begin{pmatrix}I_{k} & 0 \\0 & 0\end{pmatrix} - {T_{k}v_{k}e_{k}^{T}} - {T_{k}e_{k}v_{k}^{T}} + {T_{k}^{2}v_{k}v_{k}^{T}}}} \\{= {\begin{pmatrix}I_{k - 1} & 0 \\0 & 0\end{pmatrix} + {e_{k}e_{k}^{T}} - {T_{k}v_{k}e_{k}^{T}} - {T_{k}e_{k}v_{k}^{T}} + {T_{k}^{2}v_{k}v_{k}^{T}}}} \\{= {\begin{pmatrix}I_{k - 1} & 0 \\0 & 0\end{pmatrix} + {\left( {e_{k} - {T_{k}v_{k}}} \right)\left( {e_{k} - {T_{k}v_{k}}} \right)^{T}}}} \\{= {\begin{pmatrix}I_{k - 1} & 0 \\0 & 0\end{pmatrix} + {z_{k}z_{k}^{T}}}}\end{matrix} & \begin{matrix}\begin{matrix}\begin{matrix}\begin{matrix}(56) \\(57)\end{matrix} \\(58)\end{matrix} \\(59)\end{matrix} \\(60)\end{matrix}\end{matrix}$where z_(k)≡e_(k)−T_(k)ν_(k). Returning to P_(k), we now have:$\begin{matrix}{P_{k} = {I_{n} - {H_{1}\ldots\quad{H_{k - 1}\left( {\begin{pmatrix}I_{k - 1} & 0 \\0 & 0\end{pmatrix} + {z_{k}z_{k}^{T}}} \right)}{H\quad}_{k - 1}\ldots\quad H_{1}}}} & {\quad(61)\quad} \\{= \begin{matrix}{{I_{\quad n} - {H_{\quad 1}\ldots\quad H_{\quad{k\quad - \quad 1}}(64)\begin{pmatrix}{\quad I_{\quad{k\quad - \quad 1}}} & 0 \\0 & 0\end{pmatrix}H_{\quad{k\quad - \quad 1}}\ldots\quad H_{\quad 1}} -}\quad} \\{{H_{\quad 1}\ldots\quad H_{\quad{k\quad - \quad 1}}z_{\quad k}\quad{z_{\quad k}^{\quad T}(65)}_{k\quad - \quad 1}\ldots\quad H_{\quad 1}}\quad}\end{matrix}} & {(62)\quad} \\{{= {P_{k - 1} - {Q_{k - 1}z_{k}z_{k}^{T}Q_{k - 1}^{(64)}}}}\quad} & {(63)\quad} \\{= {P_{k - 1} - {w_{k}w_{k}^{T}}}} & {\quad(64)\quad}\end{matrix}$where w_(k)≡Q_(k−1)z_(k). Finally, the i^(th) diagonal element of theprojection matrix is $\begin{matrix}{\begin{matrix}{\left( P_{k} \right) = {{e_{i}^{r}\left( {P_{k - 1} - {w_{k}w_{k}^{T}}} \right)}e_{i}}} \\{= {\left( P_{k - 1} \right)_{ii} - {e_{u}^{T}w_{k}w_{k}^{T}e_{i}}}} \\{= {\left( P_{k - 1} \right)_{ii} - \left( w_{k} \right)_{i}^{2}}}\end{matrix}{where}} & \begin{matrix}\begin{matrix}(65) \\(66)\end{matrix} \\(67)\end{matrix} \\{\begin{matrix}{T_{k} = \frac{2}{v_{k}^{T}v_{k}}} \\{z_{k} = {e_{k} - {T_{k}v_{k}}}} \\{w_{k} = {Q_{k - 1}z_{k}}}\end{matrix}{and}} & \begin{matrix}\begin{matrix}(68) \\(69)\end{matrix} \\(70)\end{matrix} \\{P_{0} = I_{n}} & (71)\end{matrix}$

Hence, one has an LOO sum of squared residuals for every y-column columnin matrix Y. Optionally, using an ensemble error for each row producesan ensemble LOO sum of squared residuals as is used by GSMILES.

Referring now to FIG. 9, a flow chart 900 identifies some of theimportant process steps in one example of an iterative procedureemployed by GSMILES in determining a predictor model. At step 902,GSMILES module 140 receives inputted data which has been preprocessedaccording to one or more of the techniques described above. Each profileof associated measurements of variables of the inputted data is treatedas an object by GSMILES at step 904, with potentially three classes ofinformation: predictor/driver variables (X-variables),predictee/consequential variables (Y-variables), and nuisance variables(noise variables, known and unknown). Note that these classes are notmutually exclusive; hence, a variable can belong to one or more of theseGSMILES classes as dictated by the particular analysis being processed.

GSMILES calculates similarity among all objects at step 906, accordingto the techniques described above. Note that similarity may be compound,e.g., a combination of similarity measures, where each similaritycomponent is specific to a subset of X-profile variables. Note further,that GSMILES may just as well calculate dissimilarity among all objectsto arrive at the same results, but for sake of simplicity, only thesimilarity calculation method is described here, as an example. It wouldbe readily apparent to those of ordinary skill in the statistic arts, asto how to proceed on a basis using dissimilarity. GSMILES uses thesimilarity values to predict the Y-variables, as described above.However, GSMILES is not limited to predicting Y-variables, but may alsobe used to predict the X-variables themselves, via the similaritymatrix, an operation that functions as a noise filter, or smoothingfunction, to arrive at a more stable set of X variables. GSMILES mayalso be used to solve for X-variables and Y-variables simultaneously.When text variables are involved, these variables may appear in one orboth of X- and Y-profiles. GSMILES calculates similarity among the textvariables, and provides similarity values for these text values withregard to the X-profile, as well as the Y-profile when text is presentin the Y-profile. Hence, the set of text Y-variables are replaced by asimilarity Column to form the new Y-matrix, Y2-matrix.

Using the similarity values, GSMILES selects a critical subset ofobjects (identifying the locations of the tent poles) at step 908, thatcan optimally predict the Y-values (or other values being solved for) ofall objects within the precision limitations imposed by nuisanceeffects, assured by statistically valid criteria. The selection may bemade by an iterative algorithm as was discussed above, and which isfurther referred to below.

Upon identification of the tent pole locations and similarity valuesrepresenting the tent poles, as well as an estimation of the X-nonlineartransformation (“α values”) of the Y-profiles associated with thestrategic X-profiles (tent poles) by least squares regression or otheroptimization technique, GSMILES maximizes the number of tent poles atstep 910 to minimize the sum of squared prospective errors between theX- and Y-profiles. At step 912, GSMILES then trims back the number oftent poles (by “trimming”, as described above), where the GSMILES modelis trimmed back to the minimum of the prospective sum of squares tooptimize prospective predictions, i.e., to remove tent poles thatcontribute to over fitting of the model to the data used to create themodel, where even the noise associated with this data will tend to bemodeled with too many tent poles. Trimming may be carried out with theaid of Leave-One-Out cross validation techniques, as described above, orby other techniques designed to compare training error (fit error) withvalidation error (test error) to optimize the model.

FIGS. 11 and 12 illustrate an example of such comparison. FIG. 11 plots1100 the maximum absolute (ensemble) error versus the number of tentpoles used in developing the model (training or fit error versus thenumber of tent poles). It can be observed in FIG. 11, that the errorasymptotically approaches a perfect fit as the number or poles isincreased. FIG. 12 graphs 1200 the square root of the sum of the squaredLOO en-ors divided by the number of terms squared and plot this againstthe number of tent poles, as a measure of test or validation error(described above). It can be seen from FIG. 12, that somewhere in therange of 60-70 tent poles, the error terms stop decreasing and begin torapidly increase. By comparing the two charts of FIGS. 11 and 12,GSMILES makes the determination to trim the number of poles to thenumber that correlates to the location of the chart of FIG. 12 where theerror starts to diverge (somewhere in the range of 60-70 in FIG. 12,although GSMILES would be able to accurately identify the number wherethe minimum occurs, which is the point where divergence begins). Thepoles beyond this number are those that contribute to fitting the noiseor nuisance variables in the chart of FIG. 11.

After optimization of the model, the model is ready to be used incalculating predictions at step 914. Upon calculating prediction values,the present invention may optionally employ a scoring method. Scorefunctions are optimized for every outcome in the modeling process. Forexample, multivariate probabilities of survival and/or categoricaloutcomes can be optimally assigned to the GSMILES scores. Ifappropriate, the distributional property of each outcome is then used tooptimally assign a probability function to its score function. Themodeled score/probability functions may be used to find regions ofprofiles that satisfy all criteria/specifications placed upon themultiple outcomes. The profile components can be ranked according totheir importance to the derived multi-functionality.

FIG. 10 is a flow chart 1000 representing some of the important processsteps in one example of an iterative algorithm that GSMILES employs toselect the columns of a similarity matrix, such as similarity matrix Tdescribed above. To solve for the critical profiles, an initial model(i.e., Model Zero) is inputted to the system at step 1002, in matrix T,as described above with regard to FIG. 5. A least squares regression isnext performed at step 1004 to solve for the α coefficients (in thisiteration, it is the α₀ coefficients) which provide a best fit for theuse of the model (which includes only Model Zero in this iteration) topredict the Y-profiles (or X-profiles or X- and Y-profiles, or whateverthe output variables have been defined as, as discussed above).

Next, the residuals (prediction errors ε) are calculated at step 1006,as described in detail above with regard to FIGS. 5-6. The residualvalues are then analyzed by GSMILES to determine the absolute errorvalue that meets a predefined selection criteria. As described above,one example of a predefined selection criterion is maximum absoluteerror, which may be simply selected from the residuals when the residualis a vector. However, when the residuals take the form of a matrix, asin FIG. 6, an ensemble error is calculated for each row of the matrix byGSMILES, where the ensemble error is defined to leverage communalities.The ensemble errors are then used in selecting according to theselection criteria. Examples of ensemble error calculations aredescribed above. Although the above examples use maximum absolute erroras the selection criterion, other criteria may be alternatively used.Examples of alternative criteria are mean (ensemble) absolute error,median (ensemble) absolute error, mode (ensemble) absolute error,weighted average (ensemble) absolute error, robust average (ensemble)absolute error, or other predefined error measure. The residual errorvalue (or ensemble residual error value) meeting the selection criterionis identified at step 1008.

GSMILES then selects the X-profile row from the input matrix (e.g.,matrix 240) that corresponds to the row of the residual matrix fromwhich the residual error (or ensemble error) was selected. Thisidentifies a potential location of a tent pole to be used in the model.At step 1012, GSMILES then calculates similarity (or dissimilarity)values between the selected X-profile row and each row of the inputmatrix (including the selected row) and uses these similarity values topopulate the next column of the similarity matrix T, assuming that theselected X-profile row is not too close in its values (e.g., collinearor nearly collinear) with another X-profile row that has already beenpreviously selected, as determined in step 1014.

If it is determined that the values are not collinear or nearlycollinear with a previously selected tent pole profile, then thesimilarity values calculated in step 1012 are inputted to the nextcolumn of similarity matrix T at step 1016. The process then returns tostep 1004 to perform another least squares regression using the newsimilarity matrix. If the column of the selected row selected isdetermined to be collinear or nearly collinear with Model Zero and allother columns of matrix T (from previously selected X-profile rows), viastep 1014, GSMILES rejects the currently selected X-profile row and doesnot use it for a tent pole (of course, it wouldn't determine this in thefirst iteration if Model Zero were selected as a null set, since therewould be no previously selected rows). Then GSMILES determines whetherthere are any remaining rows of the X-profile which have not alreadybeen selected and considered at step 1018. If all rows have not yet beenconsidered, then GSMILES goes back to the residual error values, andselects the next error (or ensemble) error value that is next closest tothe selection criterion at step 1020. For example, if the selectioncriterion is maximum absolute value, GSMILES would select the row of theresidual values that has the second highest absolute error at this stageof the cycle.

Processing then returns to step 1012 to calculate similarity values forthe newly selected row. This subroutine is repeated until a new tentpole is selected which is not collinear or nearly collinear with ModelZero and all previous T-columns, or until it is determined at step 1018that all rows have been considered. When all rows have been considered,the similarity matrix has been completed, and no more tent poles areadded.

An optional stopping method is shown in step 1009, where, after the stepof determining the absolute error or ensemble error value that meets theselection criteria in step 1008, GSMILES determined whether the selectedabsolute error value is less than or equal to a predefined errorthreshold for the current model. If the selected error value is lessthan or equal to the predefined error threshold, then GSMILES determinesthat the similarity matrix has been completed, and no more tent polesare added. If the selected error value is greater than the predefinederror threshold, then processing continues to step 1010. Note that step1009 can be used in conjunction with steps 1014, 1018 and 1020, or as analternative to these steps.

As alluded to above, the GSMILES predictor model can be used to fit amatrix to a matrix, e.g. to fit a matrix of X-profiles to itself,inherently using eigenvalue analysis and partial least squaresprocessing. Thus, the X-profile values may be used to fit themselvesthrough a one dimensional linear transformation, i.e., a bottleneck,based on the largest singular-value eigenvalue of that matrix. Using thetechniques described above, the same procedure is used to develop asimilarity matrix, only the X-profile matrix replaces the Y-profilematrix referred to above. This technique is useful for situations wheresome of the X values are missing in the X-profile (missing data), forexample. In these situations, a row of X-profile data may contain known,useful values that the researcher doesn't necessarily want to throw outjust because all values of that row are not present. In such aninstance, imputation data may be employed, where GSMILES (or the user)puts in some estimates of what the missing values are. Then GSMILES canuse the completed X-profile matrix to predict itself. This producespredictions for the missing values which are different from theestimates that were put in. The predictions are better, because they aremore consistent with all the values in the matrix, because all of theother values in the matrix were used to determine what the missing valuepredictions are. Initial estimates of the missing values may be averageX values, or some other starting values which are reasonable for theparticular application being studied. When the predictions are outputtedfrom GSMILES, they can then be plugged into the missing data locations,and the process may be repeated to get more refined predictions.Iterations may be performed until differences between the currentreplacement modifications and the previous iteration of replacementmodifications are less than a pre-defined threshold value of correctiondifference.

Another use for this type of processing is to use it as an effectivenoise filter for the X-profile, wherein cycling the X-profile datathrough GSMILES as described above (whether there is missing data ornot) effectively smoothes the X-profile function, reduce noise levelsand acting as a filter. This results in a “cleaner” X-profile.

Still further, GSMILES may be used to predict both X- and Y-profilessimultaneously, using the X-profile also to produce tent poles. Thisagain is related to eigenvalue analysis and partial least squaresprocessing, and dimensional reduction or bottlenecking transformations.Note that GSMILES inherently produces a nonlinear analogy of partialleast squares. However, partial least squares processing may possiblyincorrectly match information (eigenvalues) of the X- and Y-matrices. Toprevent this possibility, GSMILES may optionally use the X-profilematrix to simultaneously predict both X- and Y-values in the form of acombined matrix, either stacked vertically or concatenated horizontally.If the relative weight of each matrix within the combination is aboutequal, then one achieves correct matching of the eigenvalues. Thenonlinear version of this method is accomplished by using the X-profileto predict both the X- and Y-profiles using GSMILES.

Still further, it is possible to simultaneously remove noise, imputemissing X-values, and analyze causal relationships between the rows(profiles) of the concatenated version X/Y of the two matrices (X- andY-profiles), by using GSMILES to model X/Y as both input and output.Optionally to enhance causal leverage, GSMILES is not allowed to useY-profiles in the input X/Y for tent-pole selection. Hence, strategicprofiles may be found in the X-profile part of the X/Y input matrix tooptimally predict all profiles in X stacked on Y, symbolized by X/Y.GSMILES can then cluster the resulting profiles in theprediction-enhanced X/Y matrix. This is a form of synchronization thattends to put associated heterogeneous profiles such as phenotypicproperties versus gene-expression properties, for example, into the samecluster. This method is useful to identify gene expression profiles andcompound activity profiles that tend to synchronize or anti-synchronizetogether, suggesting some kind of interaction between the genes andcompounds in each cluster.

The importance of each X-variable is determined by theMarquardt-Levenberg (ML) method applied to the GSMILES model. Hence,this process is leveraged by all Y-variables and their internalrelationships, such as communalities induced by common phenomena, whichcommon phenomena are often unknown. GSMILES may multiply a coefficientonto each variable to express the ellipticity of the basis set as afunction of the X space. Typically, these coefficients are assumed to beconstant with a value of unity, i.e., signifying global radial symmetryover the X space. The Marquardt-Levenberg algorithm can be used to testthis assumption. A byproduct of use of the Marquardt-Levenberg algorithmin this manner is the model leverage associated with each coefficientand hence, each variable. This leverage may be used to rank theX-variables.

The GSMILES nodes (tent poles) are localized basis functions based onsimilarity between locations in the model domain (X-space). The spans ofinfluence of each basis function are determined by each function'sparticular decay constants. The bigger a constant is, the faster thedecay, and hence the smaller the influence region of the nodesurrounding its domain location. The best decay value depends both onthe density of data adjacent to the node location, clustering propertiesof the data, and the functional complexity of the Y-ensemble there. Forexample, if the Y-ensemble is essentially constant in the domain regioncontaining the node location, then all adjacent data are essentiallyreplicates. Hence, the node function should essentially average theseadjacent Y-values. However, beyond such adjacent data, the nodeinfluence should decay appropriately to maintain its localized status.If decay is too fast, then the basis function begins to act like a deltafunction or dummy spike variable and cannot represent the possiblesystematic regional trends. If decay is too slow, the basis functionbegins to act like a constant. The same concept applies to data clustersin place of individual data points. In that respect, note thatindividual data points may be considered as clusters of size ormembership of one element.

To determine appropriate decay constants for each domain location in thedata, GSMILES determines the working dimension of the domain at eachdata location, and then computes a domain simplex of data adjacent toeach such location. The decay constant for each location is set to theinverse of the largest of the dissimilarity values between each locationand the simplex of adjacent data. This normalizes the dissimilarityfunction for each node according to the data density at the node. Inthis case, the normalized dissimilarity becomes unity at the mostdissimilar location within the simplex of adjacent data for eachlocation in the domain (X-space) of the data. Optionally, GSMILES canadd a few points (degrees of freedom) of data to each simplex to form acomplex. However, too few points can cause “data clumping” and too manypoints can compensate the efficacy of GSMILES. Data clumping occurs whenthe decay constant is too high for a particular data location of a datapoint or cluster of data points, so that it tends to be isolated fromthe rest of the data and cannot link properly due to insufficientoverlap with other nodes. This results in a spike node at that locationthat cannot interpolate or predict properly within its adjacent domainregion. In summary, data clumping can be localized as with singular datapoints, or it can be more global in terms of distribution of dataclusters.

While the present invention has been described with reference to thespecific embodiments thereof, it should be understood by those skilledin the art that various changes may be made and equivalents may besubstituted without departing from the true spirit and scope of theinvention. In addition, many modifications may be made to adapt aparticular situation, system, process, process step or steps, algorithm,hardware or software, to the objective, spirit and scope of the presentinvention. All such modifications are intended to be within the scope ofthe claims appended hereto.

1-26. (canceled)
 27. A method of generating a predictor model forpredicting multivariable outcomes (a matrix of rows of Y-profiles) basedupon multivariable inputs (a matrix of rows of X-profiles) withconsideration of nuisance variables, said method comprising the stepsof: analyzing each X-profile row of multivariable inputs as an object;calculating similarity among the objects; selecting tent polesdetermined to be critical profiles in supporting a prediction functionfor predicting the Y-profiles; optimizing the number of tent poles tominimize the error between the X-profiles and the Y-profiles; andperforming at least one of storing and outputting a prediction functionfor predicting the Y-profiles that results from said analyzing,calculating, selecting and optimizing wherein said Y-profiles arecalculatable for continuous variables, logistic variables and ordinalvariables. 28-35. (canceled)